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Abstract 


Spectral  embedding  based  on  the  Singular  Value  Decomposition  (SVD)  is  a 
widely  used  “preprocessing”  step  in  many  learning  tasks,  typically  leading  to  di¬ 
mensionality  reduction  by  projecting  onto  a  number  of  dominant  singular  vectors 
and  rescaling  the  coordinate  axes  (by  a  predefined  function  of  the  singular  value). 
However,  the  number  of  such  vectors  required  to  capture  problem  structure  grows 
with  problem  size,  and  even  partial  SVD  computation  becomes  a  bottleneck.  In 
this  paper,  we  propose  a  low-complexity  compressive  spectral  embedding  algo¬ 
rithm,  which  employs  random  projections  and  finite  order  polynomial  expansions 
to  compute  approximations  to  S  VD-based  embedding.  For  anmxn  matrix  with  T 
non-zeros,  its  time  complexity  is  O  ({T  +  m  +  n)  log(m  +  n)),  and  the  embed¬ 
ding  dimension  is  0(log(m  +  n)),  both  of  which  are  independent  of  the  number 
of  singular  vectors  whose  effect  we  wish  to  capture.  To  the  best  of  our  knowledge, 
this  is  the  first  work  to  circumvent  this  dependence  on  the  number  of  singular  vec¬ 
tors  for  general  SVD-based  embeddings.  The  key  to  sidestepping  the  SVD  is  the 
observation  that,  for  downstream  inference  tasks  such  as  clustering  and  classifica¬ 
tion,  we  are  only  interested  in  using  the  resulting  embedding  to  evaluate  pairwise 
similarity  metrics  derived  from  the  1 2 -norm,  rather  than  capturing  the  effect  of  the 
underlying  matrix  on  arbitrary  vectors  as  a  partial  SVD  tries  to  do.  Our  numerical 
results  on  network  datasets  demonstrate  the  efficacy  of  the  proposed  method,  and 
motivate  further  exploration  of  its  application  to  large-scale  inference  tasks. 

1  Introduction 

Inference  tasks  encountered  in  natural  language  processing,  graph  inference  and  manifold  learning 
employ  the  singular  value  decomposition  (SVD)  as  a  first  step  to  reduce  dimensionality  while  re¬ 
taining  useful  structure  in  the  input.  Such  spectral  embeddings  go  under  various  guises:  Principle 
Component  Analysis  (PCA),  Latent  Semantic  Indexing  (natural  language  processing).  Kernel  Prin¬ 
cipal  Component  Analysis,  commute  time  and  diffusion  embeddings  of  graphs,  to  name  a  few.  In 
this  paper,  we  present  a  compressive  approach  for  accomplishing  SVD-based  dimensionality  reduc¬ 
tion,  or  embedding,  without  actually  performing  the  computationally  expensive  SVD  step. 

The  setting  is  as  follows.  The  input  is  represented  in  matrix  form.  This  matrix  could  represent  the 
adjacency  matrix  or  the  Laplacian  of  a  graph,  the  probability  transition  matrix  of  a  random  walker 
on  the  graph,  a  bag-of-words  representation  of  documents,  the  action  of  a  kernel  on  a  set  of  l  points 
|x(p)  G  :  p  =  1, . . . ,  to}  (kernel  PCA) (T) ^2)  such  as 


A(p,q)  =  e  l|x(p)  x(<t)ll72a2  (or)  A(p,q)  =  /(||x(p)  -  x(g)||  <  a),  1  <p,q<l,  (1) 


where  !(•)  denotes  the  indicator  function  or  matrices  derived  from  A'-nearest-neighbor  graphs  con¬ 
structed  from  |x(p)}.  We  wish  to  compute  a  transformation  of  the  rows  of  this  to  x  n  matrix 
A  which  succinctly  captures  the  global  structure  of  A  via  euclidean  distances  (or  similarity  met¬ 
rics  derived  from  the  l^-norm,  such  as  normalized  correlations).  A  common  approach  is  to  com- 


1 


pute  a  partial  SVD  of  A,  Yl\=i  c/u/v/T;  k  <C  n,  and  to  use  it  to  embed  the  rows  of  A  into  a 
fc-dimensional  space  using  the  rows  of  E  =  [/(<7i)ui  /(<72)u2  •  ■  ■  /(crfc)ufc],  for  some  function 
/(•).  The  embedding  of  the  variable  corresponding  to  the  /-th  row  of  the  matrix  A  is  the  /-th  row  of 
E.  For  example,  f(x)  =  x  corresponds  to  Principal  Component  Analysis  (PCA):  the  fc-dimensional 
rows  of  E  are  projections  of  the  //-dimensional  rows  of  A  along  the  first  k  principal  components, 
{v;,  l  =  1, . . . ,  k}.  Other  important  choices  include  f(x)  =  constant  used  to  cut  graphs  J5)  and 
f(x)  =  1/vT  —  x  for  commute  time  embedding  of  graphs  SI-  Inference  tasks  such  as  (unsuper¬ 
vised)  clustering  and  (supervised)  classification  are  performed  using  £2  -based  pairwise  similarity 
metrics  on  the  embedded  coordinates  (rows  of  E)  instead  of  the  ambient  data  (rows  of  A). 

Beyond  the  obvious  benefit  of  dimensionality  reduction  from  n  to  k,  embeddings  derived  from  the 
leading  partial-S  VD  can  often  be  interpreted  as  denoising,  since  the  “noise”  in  matrices  arising  from 
real-world  data  manifests  itself  via  the  smaller  singular  vectors  of  A  (e.g.,  see  J5],  which  analyzes 
graph  adjacency  matrices).  This  is  often  cited  as  a  motivation  for  choosing  PCA  over  “isotropic” 
dimensionality  reduction  techniques  such  as  random  embeddings,  which,  under  the  setting  of  the 
Johnson-Lindenstrauss  (JL)  lemma,  can  also  preserve  structure. 

The  number  of  singular  vectors  k  needed  to  capture  the  structure  of  an  m  x  n  matrix  grows  with  its 
size,  and  two  bottlenecks  emerge  as  we  scale:  (a)  The  computational  effort  required  to  extract  a  large 
number  of  singular  vectors  using  conventional  iterative  methods  such  as  Lanczos  or  simultaneous 
iteration  or  approximate  algorithms  like  Nystrom  (6j,  f7)  and  Randomized  SVD  |]8)  for  computation 
of  partial  SVD  becomes  prohibitive  (scaling  as  O(fcT),  where  T  is  the  number  of  non-zeros  in  A) 
(b)  the  resulting  /, '-dimensional  embedding  becomes  unwieldy  for  use  in  subsequent  inference  steps. 

Approach  and  Contributions:  In  this  paper,  we  tackle  these  scalability  bottlenecks  by  focusing  on 
what  embeddings  are  actually  used  for:  computing  (2-based  pairwise  similarity  metrics  typically 
used  for  supervised  or  unsupervised  learning.  For  example,  K-means  clustering  uses  pairwise  Eu¬ 
clidean  distances,  and  S  VM-based  classification  uses  pairwise  inner  products.  We  therefore  ask  the 
following  question:  “Is  it  possible  to  compute  an  embedding  which  captures  the  pairwise  euclidean 
distances  between  the  rows  of  the  spectral  embedding  E  =  [/(<ti)ui  •  •  •  f(<Tk)  Ufc],  while  sidestep¬ 
ping  the  computationally  expensive  partial  SVD?”  We  answer  this  question  in  the  affirmative  by 
presenting  a  compressive  algorithm  which  directly  computes  a  low-dimensional  embedding. 

There  are  two  key  insights  that  drive  our  algorithm: 

•  By  approximating  /(cr)  by  a  low-order  (/,  <C  min {to,  //})  polynomial,  we  can  compute  the  em¬ 
bedding  iteratively  using  matrix- vector  products  of  the  form  Aq  or  AT  q. 

•  The  iterations  can  be  computed  compressively:  by  virtue  of  the  celebrated  JL  lemma,  the  em¬ 
bedding  geometry  is  approximately  captured  by  a  small  number  d  =  0(log(ro  +  n))  of  randomly 
picked  starting  vectors. 

The  number  of  passes  over  A,  AT  and  time  complexity  of  the  algorithm  are  L,  L  and  (){L(T  +  m  + 
n)  log(m  +  n))  respectively.  These  are  all  independent  of  the  number  of  singular  vectors  k  whose 
effect  we  wish  to  capture  via  the  embedding.  This  is  in  stark  contrast  to  embedding  directly  based  on 
the  partial  SVD.  Our  algorithm  lends  itself  to  parallel  implementation  as  a  sequence  of  2 L  matrix- 
vector  products  interlaced  with  vector  additions,  run  in  parallel  across  d  =  0(log(m  +  n))  randomly 
chosen  starting  vectors.  This  approach  significantly  reduces  both  computational  complexity  and 
embedding  dimensionality  relative  to  partial  SVD.  A  freely  downloadable  Python  implementation 
of  the  proposed  algorithm  that  exploits  this  inherent  parallelism  can  be  found  in  (9j- 


2  Related  work 

As  discussed  in  Section  [TT!  the  concept  of  compressive  measurements  forms  a  key  ingredient  in  our 
algorithm,  and  is  based  on  the  JL  lemma  IfTOl.  The  latter,  which  provides  probabilistic  guarantees  on 
approximate  preservation  of  the  Euclidean  geometry  for  a  finite  collection  of  points  under  random 
projections,  forms  the  basis  for  many  other  applications,  such  as  compressive  sensing  nm 

We  now  mention  a  few  techniques  for  exact  and  approximate  SVD  computation,  before  discussing 
algorithms  that  sidestep  the  SVD  as  we  do.  The  time  complexity  of  the  full  SVD  of  an  to  x  n 
matrix  is  0{mn2)  (for  to  >  n).  Partial  SVDs  are  computed  using  iterative  methods  for  eigen 
decompositions  of  symmetric  matrices  derived  from  A  such  as  AAT  and  [0  A 7  :  A  0]  lU2l.  The 
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complexity  of  standard  iterative  eigensolvers  such  as  simultaneous  iteration lfl3l  and  the  Lanczos 
method  scales  as  fl(kT)  (12).  where  T  denotes  the  number  of  non-zeros  of  A. 

The  leading  k  singular  value,  vector  triplets  {(07,  tp,  v/),  l  =  1  minimize  the  matrix 

reconstruction  error  under  a  rank  k  constraint:  they  are  a  solution  to  the  optimization  problem 
argmin||  A  —  Yl\=i  <jiui'vT\\‘f’  where  ||  •  ||i?  denotes  the  Frobenius  norm.  Approximate  SVD  algo¬ 
rithms  strive  to  reduce  this  error  while  also  placing  constraints  on  the  computational  budget  and/or 
the  number  of  passes  over  A.  A  commonly  employed  approximate  eigendecomposition  algorithm 
is  the  Nystrom  method  [j6|,  (7)  based  on  random  sampling  of  s  columns  of  A,  which  has  time  com¬ 
plexity  0(ksn  +  s3).  A  number  of  variants  of  the  Nystrom  method  for  kernel  matrices  like  (]7}  have 
been  proposed  in  the  literature.  These  aim  to  improve  accuracy  using  preprocessing  steps  such  as 
I\  -means  clustering  03  or  random  projection  trees  03.  Methods  to  reduce  the  complexity  of  the 
Nystrom  algorithm  to  0(ksn  +  /.;3 )  ||T6l ,  (17)  enable  Nystrom  sketches  that  see  more  columns  of  A. 
The  complexity  of  all  of  these  grow  as  f l(ksn).  Other  randomized  algorithms,  involving  iterative 
computations,  include  the  Randomized  SVD  (8).  Since  all  of  these  algorithms  set  out  to  recover 
/.■-leading  eigenvectors  (exact  or  otherwise),  their  complexity  scales  as  V.(kT). 

We  now  turn  to  algorithms  that  sidestep  SVD  computation.  In  (18),  (19),  vertices  of  a  graph  are 
embedded  based  on  diffusion  of  probability  mass  in  random  walks  on  the  graph,  using  the  power 
iteration  run  independently  on  random  starting  vectors,  and  stopping  “prior  to  convergence.”  While 
this  approach  is  specialized  to  probability  transition  matrices  (unlike  our  general  framework)  and 
does  not  provide  explicit  control  on  the  nature  of  the  embedding  as  we  do,  a  feature  in  common  with 
the  present  paper  is  that  the  time  complexity  of  the  algorithm  and  the  dimensionality  of  the  resulting 
embedding  are  independent  of  the  number  of  eigenvectors  k  captured  by  it.  A  parallel  implementa¬ 
tion  of  this  algorithm  was  considered  in  (20);  similar  parallelization  directly  applies  to  our  algorithm. 
Another  specific  application  that  falls  within  our  general  framework  is  the  commute  time  embedding 
on  a  graph,  based  on  the  normalized  adjacency  matrix  and  weighing  function  f(x)  =  1/ \/l  —  x  E), 
ED.  Approximate  commute  time  embeddings  have  been  computed  using  Spielman-Teng  solvers 
(22),  (23)  and  the  JL  lemma  in  |24).  The  complexity  of  the  latter  algorithm  and  the  dimensionality  of 
the  resulting  embedding  are  comparable  to  ours,  but  the  method  is  specially  designed  for  the  normal¬ 
ized  adjacency  matrix  and  the  weighing  function  f(x)  =  I  / \ff~-  x.  Our  more  general  framework 
would,  for  example,  provide  the  flexibility  of  suppressing  small  eigenvectors  from  contributing  to 
the  embedding  (e.g,  by  setting  f(x)  =  I(x  >  e)/Vl  —  x). 

Thus,  while  randomized  projections  are  extensively  used  in  the  embedding  literature,  to  the  best 
of  our  knowledge,  the  present  paper  is  the  first  to  develop  a  general  compressive  framework  for 
spectral  embeddings  derived  from  the  SVD.  It  is  interesting  to  note  that  methods  similar  to  ours 
have  been  used  in  a  different  context,  to  estimate  the  empirical  distribution  of  eigenvalues  of  a  large 
hermitian  matrix  (25).  (26).  These  methods  use  a  polynomial  approximation  of  indicator  functions 
/(A)  =  I(a  <  A  <  b)  and  random  projections  to  compute  an  approximate  histogram  of  the  number 
of  eigenvectors  across  different  bands  of  the  spectrum:  [a,  b]  C  [Amjn,  Amax], 


3  Algorithm 


We  first  present  the  algorithm  for  a  symmetric  n  x  n  matrix  S.  Later,  in  Section  13.51  we  show 
how  to  handle  a  general  m  x  n  matrix  by  considering  a  related  (m  +  n)  x  (m  +  n)  symmetric 
matrix.  Let  A /  denote  the  eigenvalues  of  S  sorted  in  descending  order  and  v/  their  corresponding 
unit-norm  eigenvectors  (chosen  to  be  orthogonal  in  case  of  repeated  eigenvalues).  For  any  func¬ 
tion  g(x)  :  R  1— >  R,  we  denote  by  g(S)  the  n  x  n  symmetric  matrix  g(S)  =  Y^iZi  <?(Az)v;v^. 
We  now  develop  an  O(nlogn)  algorithm  to  compute  a  d  =  O(logn)  dimensional  embedding 
which  approximately  captures  pairwise  euclidean  distances  between  the  rows  of  the  embedding 
E  =  [.f  (Ar)  vi  /  (A2)  v2  •  •  •  /  (An)  v„]. 

Rotations  are  inconsequential:  We  first  observe  that  rotation  of  basis  does  not  alter  /2-based  sim¬ 
ilarity  metrics.  Since  V  =  [vi  •  •  •  v„]  satisfies  VVT  =  VTV  =  Ira,  pairwise  distances  be¬ 
tween  the  rows  of  E  are  equal  to  corresponding  pairwise  distances  between  the  rows  of  EVr  = 
YfjiZ  1  /(^/)v/vF  =  f(S)-  We  use  this  observation  to  compute  embeddings  of  the  rows  of  f(S) 
rather  than  those  of  E. 
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3.1  Compressive  embedding 

Suppose  now  that  we  know  f(S).  This  constitutes  an  n-dimensional  embedding,  and  similarity 
queries  between  two  “vertices”  (we  refer  to  the  variables  corresponding  to  rows  of  S  as  vertices, 
as  we  would  for  matrices  derived  from  graphs)  requires  0{n)  operations.  However,  we  can  reduce 
this  time  to  O(logn)  by  using  the  JL  lemma,  which  informs  us  that  pairwise  distances  can  be 
approximately  captured  by  compressive  projection  onto  d  =  0(log  n)  dimensions. 

Specifically,  for  d  >  (4  +  2/3)  logn/(e2/2  —  e3/3) ,  let  Q  denote  an  nx  d  matrix  with  i.i.d.  entries 
drawn  uniformly  at  random  from  {±1/ y/d}.  According  to  the  JL  lemma,  pairwise  distances  between 
the  rows  of  f(S)Ll  approximate  pairwise  distances  between  the  rows  of  f(S)  with  high  probability. 
In  particular,  the  following  statement  holds  with  probability  at  least  1  —  n_/3:  (1  —  e)  ||u  —  v||  < 
|| (u  —  v)  U||2  <  (1  +  e)  || u  —  v||2,  for  any  two  rows  u,  v  of  f(S). 

The  key  take-aways  are  that  (a)  we  can  reduce  the  embedding  dimension  to  d  =  0(log  n),  since  we 
are  only  interested  in  pairwise  similarity  measures,  and  (b)  We  do  not  need  to  compute  f(S).  We 
only  need  to  compute  f(S)Q.  We  now  discuss  how  to  accomplish  the  latter  efficiently. 

3.2  Polynomial  approximation  of  embedding 

Direct  computation  of  E'  =  from  the  eigenvectors  and  eigenvalues  of  S,  as  f(S)  = 

/(Az)v;v^  would  suggest,  is  expensive  ( 0(n 3)).  However,  we  now  observe  that  computation 
of  fi(S)Ll  is  easy  when  f>f)  is  a  polynomial.  In  this  case,  ip(S)  =  Y2PZ o  bpSp  for  some  bp  €  M,  so 
that  il>(S)Q  can  be  computed  as  a  sequence  of  L  matrix-vector  products  interlaced  with  vector  ad¬ 
ditions  run  in  parallel  for  each  of  the  d  columns  of  f l.  Therefore,  they  only  require  LdT  +  O(  Ldn) 
flops.  Our  strategy  is  to  approximate  E'  =  f(S)Ll  by  E  =  fp(S)Ll,  where  /l(x)  is  an  L-th  order 
polynomial  approximation  of  f(x).  We  defer  the  details  of  computing  a  “good”  polynomial  approx¬ 
imation  to  Section l3~4l  For  now,  we  assume  that  one  such  approximation  /l(-)  is  available  and  give 
bounds  on  the  loss  in  fidelity  as  a  result  of  this  approximation. 


3.3  Performance  guarantees 


The  spectral  norm  of  the  “error  matrix”  Z  =  f(S)  —  f(S)  =  Er=i(/(^)_/t(^))vrvr  satisfies 
||Z||  =  6  =  max;|/(Az)  —  /l(A/)|  <  max{|/(cc)  —  fh(x) |},  where  the  spectral  norm  of  a  matrix 
B,  denoted  by  ||f3||  refers  to  the  induced  f^-norm.  For  symmetric  matrices,  ||.B||  <  a  •<=>•  |A;|  < 
a  V/,  where  A /  are  the  eigenvalues  of  B.  Letting  ip  denote  the  unit  vector  along  the  p-th  coordinate 
of  M",  the  distance  between  the  p ,  g-th  rows  of  f(S)  can  be  written  as 

II 7l(S)  (ip  ^  i,)||  =  II  f(S)  (ip  -  i q)-z  (ip  -  i,)||  <  \\ET  (ip  -  i,)||  +  6y/2.  (2) 

Similarly,  we  have  that  || /l{S)  (ip  —  ig)||  >  ||i3T  (ip  —  i9)||  —  d\[2.  Thus  pairwise  distances  be¬ 
tween  the  rows  of  fp  (S)  approximate  those  between  the  rows  of  E.  However,  the  distortion  term 
5\/2  is  additive  and  must  be  controlled  by  carefully  choosing  /l(-),  as  discussed  in  SectionQ] 

Applying  the  JL  lemma  fTOl  to  the  rows  of  /l{S),  we  have  that  when  d  >  O  ( e~ 2  logn)  with  i.i.d. 
entries  drawn  uniformly  at  random  from  {±1  /y/d},  the  embedding  E  =  captures  pairwise 

distances  between  the  rows  of  / l(S )  up  to  a  multiplicative  distortion  of  1  ±  e  with  high  probability: 


ET  (ip  —  iQ)  =  flT  fL(S)  (ip  -  ig) 


<  VT 


h(S)  (ip-i9) 


Using  ©,  we  can  show  that  \\ET  (ip  —  i9)||  <  y/1  +  e  (|| ET  (ip  —  iq)||  +  Sy/2 ).  Similarly, 
||£'t  (ip  —  iq)||  >  y/1  —  e  ( \\ET  (ip  —  ig)||  —  5y/2).  We  state  this  result  in  TheoremQ] 

Theorem  1.  Let  /l^x)  denote  an  L-th  order  polynomial  such  that:  8  =  maxi\f(Xi)  —  /l(A;)|  < 
max\f{x)  —  /l(x)  \  and  Q  an  n  x  d  matrix  with  entries  drawn  independently  and  uniformly  at 
random  from  {±l/-\/d},  where  d  is  an  integer  satisfying  d  >  (4  +  2/3)  logn/(e2/2  —  e3/3).  Let 
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g  :  Rp  — >  denote  the  mapping  from  the  i-th  row  of  E  =  [f  (Ai)  Vi  •  •  •  /  (An)  v„]  to  the  i-th 

row  of  E  =  fL(S)fl.  The  following  statement  is  true  with  probability  at  least  1  —  n 

Vl  -  e(||u  -  w||  -  S\/2)  <  ||g(u)  -  g(u)||  <  Vl  +  e(||n  -  u||  +  SV2) 
for  any  tw’o  rows  u,v  of  E.  Furthermore,  there  exists  an  algorithm  to  compute  each  of  the  d  = 
O(logn)  columns  of  E  in  0(L(T  +  n))  flops  independent  of  its  other  columns  which  makes  L 
passes  over  S  (T  is  the  number  of  non-zeros  in  S). 

3.4  Choosing  the  polynomial  approximation 

We  restrict  attention  to  matrices  which  satisfy  ||<S||  <  1,  which  implies  that  |A;|  <  1.  We  observe 
that  we  can  trivially  center  and  scale  the  spectrum  of  any  matrix  to  satisfy  this  assumption  when 
we  have  the  following  bounds:  A i  <  crmax  and  A;  >  <rmjn  via  the  rescaling  and  centering  operation 
given  by:  S'  =  2 S/(amm  -  a^n)  -  (<jmax  +  crmin)  I„/(crmax  -  <rmin)  and  by  modifying  /( x)  to 
f  (*^)  =  (^max  &mm)/2  T  (crmax  -f-  CTm in)/2). 

In  order  to  compute  a  polynomial  approximation  of  /(x),  we  need  to  define  the  notion  of  “good” 
approximation.  We  showed  in  Section  13.31  that  the  errors  introduced  by  the  polynomial  approx¬ 
imation  can  be  summarized  by  furnishing  a  bound  on  the  spectral  norm  of  the  error  matrix 
Z  =  f(S )  —  /l(5'):  Since  ||Z||  =  <5  =  max//(A;)  —  /l(Aj)|,  what  matters  is  how  well  we 
approximate  the  function  /(•)  at  the  eigenvalues  { A/ }  of  S.  Indeed,  if  we  know  the  eigenvalues, 
we  can  minimize  \\Z\\  by  minimizing  max;|/(A;)  —  /l(A;)|.  This  is  not  a  particularly  useful  ap¬ 
proach,  since  computing  the  eigenvalues  is  expensive.  However,  we  can  use  our  prior  knowledge 
of  the  domain  from  which  the  matrix  S  comes  from  to  penalize  deviations  from  /(A)  differently 
for  different  values  of  A.  For  example,  if  we  know  the  distribution  p{x)  of  the  eigenvalues  of 
S,  we  can  minimize  the  average  error  p(A)|/(A)  —  /z,(A)|2da;.  In  our  examples,  for 

the  sake  of  concreteness,  we  assume  that  the  eigenvalues  are  uniformly  distributed  over  [—1,1] 
and  give  a  procedure  to  compute  an  L- th  order  polynomial  approximation  of  /(.t)  that  minimizes 
al  =  (1/2)  f\  |  f(x)  -  /L(x)|2da;. 

A  numerically  stable  procedure  to  generate  finite  order  polynomial  approximations  of  a  function 
over  [—1, 1]  with  the  objective  of  minimizing  //  (  |  f(x)  —  /i(x)|2da:  is  via  Legendre  polynomials 
p(r,x),  r  =  0,1 ,L.  They  satisfy  the  recursion  p(r,x)  =  (2  —  l/r)xp(r  —  l,x)  —  (1  — 
l/r)p(r  —  2,x)  and  are  orthogonal:  f^1p(k,x)p(l,x) Ax  =  2 1(k  =  l)/(2r  +  1).  Therefore  we 

set  /l(x)  =  Ylr=o  a(r)p(.ri x)  where  a(r)  =  (r  +  1/2)  p(r,  x)f(x)dx.  We  give  a  method 
in  Algorithm Q] that  uses  the  Legendre  recursion  to  compute  p(r,  S)(A.  r  =  0,1 ,L  using  Ld 
matrix-vector  products  and  vector  additions.  The  coefficients  a(r)  are  used  to  compute  fL{S)fl  by 
adding  weighted  versions  of  p(r,  S)£l. 


Algorithm  1  Proposed  algorithm  to  compute  approximate  d-dimensional  eigenvector  embedding  of 
an  x  n  symmetric  matrix  S  (such  that  ||S'||  <  1)  using  the  n  x  d  random  projection  matrix  S2. 

1 

Procedure  FastEmbedEIG(S',  /(x),  L,  fl): 

2 

II*  Compute  polynomial  approximation  /l(x)  which  minimizes  f  ,  |/(x)  —  /i(x)|2dx  *// 

3 

for  r  =  0, . . . ,  L  do 

4 

a(r)  <!-  (r  +  1/2)  /(x)p(r,  x)dx  II*  p{r , 

x):  Order  r  Legendre  polynomial  *// 

5 

Q{ 0)  <-  a  Q(— 1)  <-  0,  E  <r-  a(0)Q(0) 

6 

for  r  =  1, 2, . . . ,  L  do 

7 

Q(r)  <-  (2  -  Vr)SQ(r  -  1)  -  (1  -  l/r)Q(r  -  2) 

II*  Q{r)  =  p(r,  S)fl  *// 

8 

E  4—  E  +  a(r)Q(r) 

II*  E  now  holds  fr(S)H  *// 

9 

return  E 

II*  E  =  Jl{S)VL  *// 

As  described  in  Section@]  if  we  have  prior  knowledge  of  the  distribution  of  eigenvalues  (as  we  do  for 
many  commonly  encountered  large  matrices),  then  we  can  “boost”  the  performance  of  the  generic 
Algorithm[T|based  on  the  assumption  of  eigenvalues  uniformly  distributed  over  [—1, 1]. 
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3.5  Embedding  general  matrices 


We  complete  the  algorithm  description  by  generalizing  to  any  m  x  n  matrix  A  (not  neces¬ 
sarily  symmetric)  such  that  ||A||  <  1.  The  approach  is  to  utilize  Algorithm  [T]  to  compute 
an  approximate  d-dimensional  embedding  of  the  symmetric  matrix  S  =  [0  AT ;  A  0] .  Let 
{(cr/,u/,  vz)  :  l  =  1,.. .  ,min{m,n}}  be  an  SVD  of  A  =  J2icriui'vT  (ll^ll  <  1  <=>• 

t i  <  1).  Consider  the  following  spectral  mapping  of  the  rows  of  A  to  the  rows  of  Erow  = 
[f Oi)ui  •••  /(  <rm) um]  and  the  columns  of  A  to  the  rows  of  Eco i  =  [/(cri)vi  •  •  •  f(an)vn]. 
It  can  be  shown  that  the  unit-norm  orthogonal  eigenvectors  of  S  take  the  form  [v/;u;]/\/2 
and  [v;;  —  ui]/y/2,  l  =  1, . . . ,  min{m,  n},  and  their  corresponding  eigenvalues  are  er /  and  —  a 
respectively.  The  remaining  | m  —  n  eigenvalues  of  S  are  equal  to  0.  Therefore,  we  call 
-Eaii  -s—  FastEmbedEIG(S',  L,  Cl)  with  f(x)  =  f(x)I(x  >  0)  —  f(—x)I(x  <  0)  and  fl 
is  an  (to  +  n)  x  d,d  =  0(log(m  +  n))  matrix  (entries  drawn  independently  and  uniformly  at  ran¬ 
dom  from  {±1/ Vd}).  Let  L’col  and  Brow  denote  the  first  n  and  last  to  rows  of  /i’an-  From  TheoremU] 
we  know  that,  with  overwhelming  probability,  pairwise  distances  between  any  two  rows  of  Erow  ap¬ 
proximates  those  between  corresponding  rows  of  EIOV/.  Similarly,  pairwise  distances  between  any 
two  rows  of  Eco i  approximates  those  between  corresponding  rows  of  Eco\. 


4  Implementation  considerations 

We  now  briefly  go  over  implementation  considerations  before  presenting  numerical  results  in  Sec¬ 
tion  0 

Spectral  norm  estimates  In  order  to  ensure  that  the  eigenvalues  of  S  are  within  [—1,1]  as  we  have 
assumed,  we  scale  the  matrix  by  its  spectral  norm  (||5||  =  max|A;|).  To  this  end,  we  obtain  a  tight 
lower  bound  (and  a  good  approximation)  on  the  spectral  norm  using  power  iteration  (20  iterates  on 
61ogn  randomly  chosen  starting  vectors),  and  then  scale  this  up  by  a  small  factor  (1.01)  for  our 
estimate  (typically  an  upper  bound)  for  ||5||. 

Polynomial  approximation  order  L:  The  error  in  approximating  /(A)  by  /l(A),  as  measured  by 
Al  =  f_1  |/(x)  —  /i(x) |2dx  is  a  non-increasing  function  of  the  polynomial  order  L.  Reduction 
in  A/,  often  corresponds  to  a  reduction  in  S  that  appears  as  a  bound  on  distortion  in  Theorem  [I] 
“Smooth”  functions  generally  admit  a  lower  order  approximation  for  the  same  target  error  A^,  and 
hence  yield  considerable  savings  in  algorithm  complexity,  which  scales  linearly  with  L. 

Polynomial  approximation  method:  The  rate  at  which  <5  decreases  as  we  increase  L  depends  on 
the  function  p(X)  used  to  compute  /l(A)  (by  minimizing  A^  =  f  p(A)|/(A)  —  /^( A)|2dx).  The 
choice  p( A)  oc  1  yields  the  Legendre  recursion  used  in  AlgorithmQ]  whereas  p( A)  oc  1/y/l  —  A2 
corresponds  to  the  Chebyshev  recursion,  which  is  known  to  result  in  fast  convergence.  We  defer  to 
future  work  a  detailed  study  of  the  impact  of  alternative  choices  for  p( A)  on  5. 

Denoising  by  cascading  In  large-scale  problems,  it  may  be  necessary  to  drive  the  contribution  from 
certain  singular  vectors  to  zero.  In  many  settings,  singular  vectors  with  smaller  singular  values  cor¬ 
respond  to  noise.  The  number  of  such  singular  values  can  scale  as  fast  as  0(min{m,  n}).  Therefore, 
when  we  place  nulls  (zeros)  in  /(A),  it  is  desirable  to  ensure  that  these  nulls  are  pronounced  after 

we  approximate  /(A)  by  /l(A).  We  do  this  by  computing  (gL/b(S))b  Q,  where  gL/b( A)  is  an  L/b- 
th  order  approximation  of  g(X)  =  /  '  j'b(X).  The  small  values  in  the  polynomial  approximation  of 
/1/h( A)  which  correspond  to  /(A)  =  0  (nulls  which  we  have  set)  get  amplified  when  we  pass  them 
through  the  xb  non-linearity. 

5  Numerical  results 

While  the  proposed  approach  is  particularly  useful  for  large  problems  in  which  exact  eigendecom- 
position  is  computationally  infeasible,  for  the  purpose  of  comparison,  our  results  are  restricted  to 
smaller  settings  where  the  exact  solution  can  be  computed.  We  compute  the  exact  partial  eigende- 
composition  using  the  ARPACK  library  (called  from  MATLAB).  For  a  given  choice  of  weighing 


6 


20  40  60  80  100  120 

d 


(a)  Effect  of  dimensionality  d  of  embedding 


(b)  Effect  of  cascading:  b  =  l(left)  and  b  =  2  (right) 


Figure  1 :  DBLP  collaboration  network  normalized  correlations 


function  /(A),  the  associated  embedding  E  =  [/(Ai)vi  •  •  •  /(An)v„]  is  compared  with  the  com¬ 
pressive  embedding  E  returned  by  Algorithm  [I]  The  latter  was  implemented  in  Python  using  the 
Scipy’s  sparse  matrix-multiplication  routines  and  is  available  for  download  from  f9|. 

We  consider  two  real  world  undirected  graphs  in  f27)  for  our  evaluation,  and  compute  embeddings 
for  the  normalized  adjacency  matrix  A  (=  D~1/2AD~1/2,  where  D  is  a  diagonal  matrix  with  row 
sums  of  the  adjacency  matrix  A;  the  eigenvalues  of  A  lie  in  [—1, 1])  for  graphs.  We  study  the  ac¬ 
curacy  of  embeddings  by  comparing  pairwise  normalized  correlations  between  i,j- th  rows  of  E 
given  by  <  E(i,  :),E(j, :)  >/|| E(i, :) || || J57 (j,  :)||  with  those  predicted  by  the  approximate  embed¬ 
ding  <  E(i, :),  E(j, :)  >  /\\E(i,  :)||||£J(i7,  :)||  (E(i, :)  is  short-hand  for  the  i-th  row  of  E). 

DBLP  collaboration  network  ||27l  is  an  undirected  graph  on  n  =  317080  vertices  with  1049866 
edges.  We  compute  the  leading  500  eigenvectors  of  the  normalized  adjacency  matrix  A.  The  small¬ 
est  of  the  five  hundred  eigenvalues  is  0.98,  so  we  set  /(A)  =  /(A  >  0.98)  and  S  —  A  in  Algorithm!]] 
and  compare  the  resulting  embedding  E  with  E  =  [v i  •  ■  •  V500].  We  demonstrate  the  dependence 
of  the  quality  of  the  embedding  E  returned  by  the  proposed  algorithm  on  two  parameters:  (i)  num¬ 
ber  of  random  starting  vectors  d ,  which  gives  the  dimensionality  of  the  embedding  and  (ii)  the 
boosting/cascading  parameter  b  using  this  dataset. 

Dependence  on  the  number  of  random  projections  d:  In  Figure  (Hat,  d  ranges  from  1  to  120  ~  9  log  n 
and  plot  the  1-st,  5-th,  25-th,  50-th,  75-th,  95-th  and  99-th  percentile  values  of  the  deviation  between 
the  compressive  normalized  correlation  (from  the  rows  of  E)  and  the  corresponding  exact  normal¬ 
ized  correlation  (rows  of  E ).  The  deviation  decreases  with  increasing  d ,  corresponding  to  f^-norm 
concentration  (JL  lemma),  but  this  payoff  saturates  for  large  values  of  d  as  polynomial  approxima¬ 
tion  errors  start  to  dominate.  From  the  5-th  and  95-th  percentile  curves,  we  see  that  a  significant 
fraction  (90%)  of  pairwise  normalized  correlations  in  E  lie  within  ±0.2  of  their  corresponding  val¬ 
ues  in  E  when  d  =  80  ~  6  log  n.  For  Figure  (flail,  we  use  L  =  180  matrix-vector  products  for  each 
randomly  picked  starting  vector  and  set  cascading  parameter  6  =  2  for  the  algorithm  in  Section!]] 

Dependence  on  cascading  parameter  b:  In  Section|4]we  described  how  cascading  can  help  suppress 
the  contribution  to  the  embedding  E  of  the  eigenvectors  whose  eigenvalues  lie  in  regions  where  we 
have  set  /(A)  =  0.  We  illustrate  the  importance  of  this  boosting  procedure  by  comparing  the  quality 
of  the  embedding  E  for  6=1  and  6  =  2  (keeping  the  other  parameters  of  the  algorithm  in  Sectionf]] 
fixed:  L  =  180  matrix-vector  products  for  each  of  d  =  80  randomly  picked  starting  vectors). 
We  report  the  results  in  Figure  (ITbli  where  we  plot  percentile  values  of  compressive  normalized 
correlation  (from  the  rows  of  E)  for  different  values  of  the  exact  normalized  correlation  (rows  of 
E).  For  6=1,  the  polynomial  approximation  of  /(A)  does  not  suppress  small  eigenvectors.  As  a 
result,  we  notice  a  deviation  (bias)  of  the  50-percentile  curve  (green)  from  the  ideal  y  =  x  dotted 
line  drawn  (Figure  [Tb]  left).  This  disappears  for  6  =  2  (Figure  [Tb]  right). 

The  running  time  for  our  algorithm  on  a  standard  workstation  was  about  two  orders  of  magnitude 
smaller  than  partial  S  VD  using  off-the-shelf  sparse  eigensolvers  (e.g.,  the  80  dimensional  embedding 
of  the  leading  500  eigenvectors  of  the  DBLP  graph  took  1  minute  whereas  their  exact  computation 
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took  105  minutes).  A  more  detailed  comparison  of  running  times  is  beyond  the  scope  of  this  paper, 
but  it  is  clear  that  the  promised  gains  in  computational  complexity  are  realized  in  practice. 

Application  to  graph  clustering  for  the  Amazon  co-purchasing  network  ll27l  :  This  is  an  undi¬ 
rected  graph  on  n  =  334863  vertices  with  925872  edges.  We  illustrate  the  potential  downstream 
benefits  of  our  algorithm  by  applying  K -means  clustering  on  embeddings  (exact  and  compressive) 
of  this  network.  For  the  purpose  of  our  comparisons,  we  compute  the  first  500  eigenvectors  for  A  ex¬ 
plicitly  using  an  exact  eigensolver,  and  use  an  80-dimensional  compressive  embedding  E  which  cap¬ 
tures  the  effect  of  these,  with  /(A)  =  /(A  >  A500),  where  A500  is  the  500th  eigenvalue.  We  compare 
this  against  the  usual  spectral  embedding  using  the  first  80  eigenvectors  of  A:  E  =  [vi  ■  ■  ■  v80]. 
We  keep  the  dimension  fixed  at  80  in  the  comparison  because  /T-means  complexity  scales  linearly 
with  it,  and  quickly  becomes  the  bottleneck.  Indeed,  our  ability  to  embed  a  large  number  of  eigen¬ 
vectors  directly  into  a  low  dimensional  space  (d  ps  (i  log  n)  has  the  added  benefit  of  dimensionality 
reduction  within  the  subspace  of  interest  (in  this  case  the  largest  500  eigenvectors). 

We  consider  25  instances  of  A’ -means  clustering  with  K  =  200  throughout,  reporting  the  median 
of  a  commonly  used  graph  clustering  score,  modularity  Il28l  (larger  values  translate  to  better  clus¬ 
tering  solutions).  The  median  modularity  for  clustering  based  on  our  embedding  E  is  0.87.  This  is 
significantly  better  than  that  for  E,  which  yields  median  modularity  of  0.835.  In  addition,  the  com¬ 
putational  cost  for  E  is  one-fifth  that  for  E  (1.5  minutes  versus  10  minutes).  When  we  replace  the 
exact  eigenvector  embedding  E  with  approximate  eigendecomposition  using  Randomized  SVD  {8} 
(parameters:  power  iterates  q  =  5  and  excess  dimensionality  l  =  10),  the  time  taken  reduces  from 
10  minutes  to  17  seconds,  but  this  comes  at  the  expense  of  inference  quality:  median  modularity 
drops  to  0.748.  On  the  other  hand,  the  median  modularity  increases  to  0.845  when  we  consider  ex¬ 
act  partial  SVD  embedding  with  120  eigenvectors.  This  indicates  that  our  compressive  embedding 
yields  better  clustering  quality  because  it  is  able  to  concisely  capture  more  eigenvectors(500  in  this 
example,  compared  to  80  and  120  with  conventional  partial  SVD).  It  is  worth  pointing  out  that,  even 
for  known  eigenvectors,  the  number  of  dominant  eigenvectors  k  that  yields  the  best  inference  per¬ 
formance  is  often  unknown  a  priori,  and  is  treated  as  a  hyper-parameter.  For  compressive  spectral 
embedding  E,  an  elegant  approach  for  implicitly  optimizing  over  k  is  to  use  the  embedding  function 
/(A)  =  /(A  >  c),  with  c  as  a  hyper-parameter. 


6  Conclusion 

We  have  shown  that  random  projections  and  polynomial  expansions  provide  a  powerful  approach  for 
spectral  embedding  of  large  matrices:  for  an  mxn  matrix  A,  our  0((T+m  +  n)  log  (m  +  n))  algo¬ 
rithm  computes  an  O  (log  ( m  +  n ) )  -  d  i  m  e  n  s  i  o  n  a  I  compressive  embedding  that  provably  approximates 
pairwise  distances  between  points  in  the  desired  spectral  embedding.  Numerical  results  for  several 
real-world  data  sets  show  that  our  method  provides  good  approximations  for  embeddings  based  on 
partial  SVD,  while  incurring  much  lower  complexity.  Moreover,  our  method  can  also  approximate 
spectral  embeddings  which  depend  on  the  entire  SVD,  since  its  complexity  does  not  depend  on  the 
number  of  dominant  vectors  whose  effect  we  wish  to  model.  A  glimpse  of  this  potential  is  provided 
by  the  example  of  K -means  based  clustering  for  estimating  sparse-cuts  of  the  Amazon  graph,  where 
our  method  yields  much  better  performance  (using  graph  metrics)  than  a  partial  SVD  with  signifi¬ 
cantly  higher  complexity.  This  motivates  further  investigation  into  applications  of  this  approach  for 
improving  downstream  inference  tasks  in  a  variety  of  large-scale  problems. 
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